SpaN is a company that provides an online portal for real estate services. The main functionality of the portal is that property listings are published by real estate agents and visitors can search for properties based on a set of search criteria.
As a Data Scientist for SpaN, you closely cooperate with various departments to assist in decision making and strategy execution, based on actionable insights that you get out of data.
The following assignment consists of 3 parts. Each part has specific requirements and deliverables as described in the corresponding sections. The implementation needs to be either in Python or in R. After you complete the assignment, you need to provide a link to a public git repository that includes your code (in one file or several folders).
The given dataset is this month’s snapshot of all the listings of residential properties (houses) for sale listed on the portal of SpaN for 4 specific areas of Athens (geography_name) . Each listing has a unique id but it can be the case that the same actual property is uploaded by multiple real estate agents so multiple different listings (with different ids) for the same property are created. Each agent is identified by a unique agent_id .
The rank that listings are ordered by in a specific area when a user makes a search is depends on the type of the listing (ad_type) and their listing score (ranking_score) . There are four different listing types: simple listings , that appear last , up listings that rank above simple, premium listings that rank above up, and star listings that appear at the top of the list. Within each listing type group properties are ranked based on the listing score.
The size of the property (sq_meters) , its price and the area (geography_name) are the main search filters that users use in their initial search. The rest of the columns of the dataset are all further attributes of the properties listed and can be used as filters on a search. The year_of_construction column represents the year that the house was built and the value 2155 means that the house is under construction .
The marketing department wants to issue a press release regarding house prices in different areas of Athens. They ask if you could help them by providing some useful statistics that would show the price levels of houses for sale this month that real estate journalists would find useful. For this purpose, you will need to calculate tables that show some metrics, namely the mean, median and standard deviation of property prices per house type (subtype) and per area (geography_name). Before you calculate the final metrics, keep in mind that you should clean the dataset from any entries that you think should not be included in the analysis, because they will corrupt the right image of the price levels of each area.
After examining the dataset for this part , i decided to remove some rows concerning non applicable values for this month & outlier prices
The logic behind removing rows is as follows :
First of all ,everything concerning under construction prices was removed from the analysis. This tasks requires an analysis on the prices of this month , thus , including under construction (future) prices did not seem realistic. Then some listings below 1000 euros were removed as unrealistic too
Then outlier prices were removed. Prices out of +/- 1.5*IQR range were removed as outliers but not for the entire dataset. Prices were removed specifically for each different geography_name group.
Explanation:
If outlier prices were removed across the entire dataset , then we would see that ~0% of beesy neighborhood listings would be removed , 1% for gentrification areas but 8% and 10% of south beach and northern sub respectively.
This propably happens due to these 2 areas being more expensive by default so it is common sense that they will have the most outliers accros the dataset . But since we need an analysis by geography it seemed more fair to remove outliers inside each geography name.
If we do the same removal method but splitting the data by geography name , the percentages removed go as follows : beesy neighborhood (8%) , gentrification area (9.5%) ,northern sub (7.5%) ,south beach (8.7%) which seems a lot more balanced.
After doing the exclusion of the rows that apply to the foretold creteria (under_construction,undrealistically low prices,outliers) we end up with around 88% of the starting raw dataset and we can start exploring the prices in the requested groups:
Comments for the Marketing Team :
The Sales Manager of SpaN , after conducting qualitative research, by asking different agents in each area in Athens, wants to examine the possibility of offering special discounts for some listing types, based on the competitiveness of each area. To decide what type of discount should be given to agents in each area she would need to see an analysis of the competitiveness of each area. A highly competitive area would mean that it would be hard for a simple listing to rank high in the search results of this area just by having a high ranking score. To help the sales manager decide the level of discount to be given to agents in each area, you would need to:
After examining the dataset for this part , i decided to create a competition metric called opportunity score based on 2 combined key factors. Opportunity score measures if a simple listing ad would do great if it became up,premium,or star listing.
This means that if we get high opportunity score values the competitiveness in upper tiers of ad_types is lower so the opportunity to make them paid is bigger cause according to their ranking they would rank in a good position in the selected ad_type.
On the contrary low opportunity scores means that the area is quite competitive and even if the simple ad became paid it would have minor results in its ranking ,besides of course going above all other simple listings.
Having said that here are the 2 key factors used in the opportunity score creation:
For number 1 the metric type is : (1 - percent of ad_type in area) - eg. (In beesy neighborhood there are 2182 listings , 34 (1.55%) up listings , 30 (1.37%) star & 44 (2%) premium listings) - Their score would be 1-0.0155=0.9845 for up listings etc.
For number 2 the metric type is : (median(ranking_score of simple listings) / median(ranking_score of up or premium or star listings) ) -1 - eg. (In beesy neighborhood the median simple ranking score is 116.15, median ranking score of up listigns is 113.1. Thus (116.15/113.1)-1 = 0.026 )
So the opportunity score of simple listing ads in beesy neighborhood for up listings would be 0.9845/0.026 = ~ 37
Calculating the mean score of each ad_type in a geography name gives as our area opportunity score as well.
Comments for the Sales Team :
The product team of SpaN wants to launch a new page on the portal that would help agents decide the correct price they should set for a property for sale in Athens. The agent would need to input certain attributes of a property and an algorithm would value the property, based on historical data. The team is building an MVP (minimum viable product) that would be launched in beta, in order to measure the willingness of agents to use the new page and get feedback on the accuracy of the predictions, based on the experience of agents in the market.
They ask if you could help them identify what are the most important attributes that an agent would have to input to get a valid prediction of a property’s price valuation and also build the model that would predict the value (price).
Using the data from the given dataset: - Identify the most important attributes in predicting the price of a property. - Build a model that valuates each residential property
First of all as a dataset already cleaned by outliers in price, i chose to use the cleaned per geography dataset from Part 1 used for the marketing report
The split between training and test set used is 80%-20% respectively.
Then we perform an NA analysis which variables we can use as predictors for the price in order to build the model and identify the most important variables in it.
As we can see all variables after balcony area have almost more than 10% of their entirety missing , thus they are all excluded from the analysis in order to minimize the effort into building the model.
Energy class is also removed , while renovation year is kept accepting that if the listing was not renovated it gets the value 0 instead of NA
We will also exclude irrelevant features as id,ranking_score,agent_id keeping only characteristics of each house
Model Specifics
To perform the predictions a simple linear regression model is selected due to easy interpretability and implementation.
The numeric data were all centered and scaled , while categorical data were transformed into dummy variables meaning, a factor with 2 categories will break up to 2 different binary columns one for each category.
To select the optimal number of the remaining features an rfe method was implemented (recursive feature elimination)
The rfe function suggest that we keep all the 13 features in our model but we can see (Tabe 5) that if we keep 11 we get almost the same results in RMSE,R-Squared and the sd of R-squared is the same. So i decided to keep the 11 most important . (Practically it is even less than eleven (8) since subsets of geography name are stand alone variables after data transformation of factors to dummy variables)
To plot the importance of each feature i used the varImp function which uses the absolute value of the t-statistic for each model parameter that is used.
By taking into account the above information in the final model we include (ordered by importance)
The model by these variables has an R-Squared of around 0.66 (This means that the features selected explain 66% of the variance of Price, which is a considerable value but still has a lot room for improvement) & an RMSE (root mean squared error) of 4910.12 euros , while the average prediction is from 30-35 % of the actual price
After the first beta iteration there is a lot of room for improvement:
We could